Search CORE

University of Liverpool Repository

Yet Another Ranking Function for Automatic Multiword Term Extraction

Author: A. Barrón-Cedeño
A. Hliaoutakis
A. Ittoo
F. Rousseau
J.A. Lossio-Ventura
K. Frantzi
K. Kageura
L. Ji
M.S. Conrado
N.J. Eck Van
R. Blanco
T.. Noh
V. Stoykova
Y. Matsuo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceTerm extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to term extraction, e.g., noise, silence, low frequency, large-corpora, complexity of the multiword term extraction process. Instead, we focus on managing the entire set of problems, e.g., detecting rare terms and overcoming the low frequency issue. We show that the two proposed measures outperform precision results previously reported for automatic multiword extraction by comparing them with the state-of-the-art reference measures

A Cross-Lingual Similarity Measure for Detecting Biomedical Term Translations

Author: A Cichocki
C Ding
CD Manning
Danushka Bollegala
E Morin
FJ Och
Georgios Kontonatsios
GH Golub
H Wold
H Wold
K Frantzi
L Breiman
L van der Maaten
ME Tipping
N Okazaki
Neil R. Smalheiser
NT Duc
P Geladi
P Turney
PD Turney
PD Turney
R Rosipal
Sophia Ananiadou
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/06/2015
Field of study

Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)—a non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as English–French, English–Spanish, English–Greek, and English–Japanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks

CiteSeerX

Public Library of Science (PLOS)

Directory of Open Access Journals

Edge Hill University Research Information Repository

The University of Manchester - Institutional Repository

FigShare

Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

BACKGROUND: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. RESULTS: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. CONCLUSIONS: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods

Online Research @ Cardiff

Directory of Open Access Journals

Dagstuhl Research Online Publication Server

The University of Manchester - Institutional Repository

Supporting systematic reviews using LDA-based document representations

Author: AM Cohen
AM Cohen
BC Wallace
BC Wallace
C Counsell
CC Chang
D Demner-Fushman
DM Blei
E Linstead
F Boudin
FR Octaviano
G Maskeri
J García Adeva
K Frantzi
K Henderson
K Romero Felizardo
L Hunter
M Barza
M Fiszman
M Miwa
MW Berry
O Frunza
R Akbani
RA Redner
S Ananiadou
S Arora
S Jonnalagadda
S Kotsiantis
S Matwin
SK Lukins
T Bekhuis
T Bekhuis
T Bekhuis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND: Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW). METHODS: We explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation. RESULTS: Our results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain. CONCLUSIONS: A topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13643-015-0117-0) contains supplementary material, which is available to authorized users

Edge Hill University Research Information Repository

The University of Manchester - Institutional Repository

Automatic term identification for bibliometric mapping

Author: A Kopcsa
A Rip
C Jacquemin
CD Manning
E Leopold
ECM Noyons
ECM Noyons
Ed C. M. Noyons
F Janssens
FJ Damerau
H Eto
H Eto
HPF Peters
JS Justeson
JW Schneider
K Frantzi
K Kageura
KW Church
Ludo Waltman
Nees Jan van Eck
NJ Eck Van
NJ Van Eck Van
P Drouin
Reindert K. Buter
T Dunning
T Hofmann
Y Ding
Y Matsuo
Publication venue: Springer Netherlands
Publication date: 01/01/2010
Field of study

A term map is a map that visualizes the structure of a scientific field by showing the relations between important terms in the field. The terms shown in a term map are usually selected manually with the help of domain experts. Manual term selection has the disadvantages of being subjective and labor-intensive. To overcome these disadvantages, we propose a methodology for automatic term identification and we use this methodology to select the terms to be included in a term map. To evaluate the proposed methodology, we use it to construct a term map of the field of operations research. The quality of the map is assessed by a number of operations research experts. It turns out that in general the proposed methodology performs quite well

Erasmus University Digital Repository

EUR Research Repository

EnvMine: A text-mining system for the automatic extraction of contextual information

Author: BJ Finlay
CA Lozupone
H Sokol
I Spasic
J Leidner
J Tamames
Javier Tamames
JB Martiny
JC Cho
K Frantzi
L Hirschman
M Krallinger
M Pignatelli
M Rajilic-Stojanovic
M Scherf
MC Horner-Devine
R Volz
RJ Whitaker
S Ananiadou
S Benlloch
S Teufel
T Pommier
Victor de Lorenzo
W Mann
Y Mizuta
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles). So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations) from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved) of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude), thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical variables of sampling sites, thus facilitating the performance of ecological analyses. EnvMine can also help in the development of standards for the annotation of environmental features.</p

Directory of Open Access Journals